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Abstract. Cluster analysis often serves as the initial step in the process 
of data classification. In this paper, the problem of clustering different 
length input data is considered. The edit distance as the minimum num- 
ber of elementary edit operations needed to transform one vector into 
another is used. A heuristic for clustering unequal length vectors, ana- 
logue to the well known /c-means algorithm is described and analyzed. 
This heuristic determines cluster centroids expanding shorter vectors to 
the lengths of the longest ones in each cluster in a specific way. It is shown 
that the time and space complexities of the heuristic are linear in the 
number of input vectors. Experimental results on real data originating 
from a system for classification of Web attacks are given. 



1 Introduction 

Clustering can be informally defined as the process of grouping objects that 
are similar in some way, where the number of groups may be unknown. The 
proces of classification often begins with clustering of a data subset, in order to 
determine the initial categories. The clusters obtained in such a way are then 
used to categorize the rest of available data. The methods of cluster analysis of 
equal length vectors have been widely treated in the literature (see, for example, 
[2, 4, 5,7]). For such clustering, the classical distance measures can be used, such 
as the well known Hamming distance. However, if unequal length vectors are 
to be clustered, it is necessary to introduce new distance measures, since the 
classical ones cannot be used in such cases [11]. 

Two major groups of clustering methods exist: hierarchical (agglomerative) , 
in which each group of size greater than one is composed of smaller groups, and 
non hierarchical (partitioning), in which every object is assigned to exactly one 
group. 

In this paper, a non hierarchical heuristic for clustering vectors of different 
lengths, analogue to the fc-means procedure of MacQueen [8] is described and 
analyzed. The unconstrained edit distance measure, as the minimum number of 
elementary edit operations (deletions and substitutions) needed to transform one 
vector into another is used as a distance measure. The essence of the heuristic is 
the method of generating the new centroids. This is performed by expanding the 



shorter vectors of each current cluster to the length of the longest one in that 
cluster and manipulating the numbers of symbol occurences at each coordinate 
of these expanded vectors. The time and memory complexities of the heuristic 
are analyzed, and experimental results on artificial as well as real data (the 
encodings of some Web attacks) are given. 

The paper is organized as follows. Section 2 gives some preliminaries about 
clustering in general and the unconstrained edit distance. In Section 3, the new 
heuristic for clustering vectors of different lengths analogue to the /c-means al- 
gorithm is described, together with the possible variants, obtained by modifying 
the way of manipulating the numbers of symbol occurences at the coordinates of 
expanded vectors. In Section 4, the time and space complexities of the heuristic 
are analyzed. Finally, in section 5, the experimental results on random as well 
as real samples are given. 

2 Preliminaries 

In this paper, we consider the following problem: 

Let P be a set of vectors, whose cardinality is m, and whose elements are 
pi, . . . , pm, of dimensions ni, . . . , Um, respectively. The task of cluster analysis 
of such vectors is: partition the set V into k nonempty subsets. Pi, . . . ,Pk, such 
that the following holds: 

Pi U P2 U . . . U Pfe = P (1) 

PinPj = 0, i,j = l,2,...,k, i^j, (2) 

optimizing some of the partition criteria. The number k may be given in advance, 
but it need not be. The partition criterion that is usually used is the sum-of- 

squares criterion. 

Let pI and p| be the r-th and s-th element of the subset Pi of the set P, 
i G {1, . . . , k}, r,sG{l,...,\Pi\},r^s. Let <i(p[, pf) be the distance between 
p[ and p| defined in some way. Then the sum-of-squares criterion is given by 
the following expression: 

\Pi\ 

r,s=l;r^s 

It is well known (see for example [6] ) that the problem of minimization of the 
sum-of-squares criterion is NP hard. However, there are many algorithms that 
find a local minimum of this criterion. For example, the well known A;-means 
algorithm [8] is of this kind. 

Let X and Y be vectors of lengths N and M, respectively, whose coordi- 
nates take the values from the discrete alphabet A. Edit distance between the 
vectors X and Y is defined as the minimum number of elementary edit opera- 
tions (substitutions, deletions, and/or insertions) needed to transform X into Y. 



Certain constraints can be incorporated into this definition, which can concern 
the total number of elementary edit operations, the maximum deletion and/or 
insertion run lengths, the total number of deletion and/or insertion runs, etc. 
The combinations of these constraints are also possible [10,11]. In this paper, 
the unconstrained edit distance, where the elementary edit operations are sub- 
stitutions and deletions, is used as it models well the possible transformations 
of the input vectors. 

Nonnegative real-valued elementary edit distances are associated with the 
corresponding elementary edit operations: 

1. d{x, (j)) is the elementary distance associated with the deletion oi x ^ A from 
the vector X, where the 'empty' symbol </> is introduced to represent deletion; 

2. d{x,y) is the elementary distance associated with the substitution of x by 
y, x,y G A. Usually, d{x, x) — 0, Vx. If d{x, y) ^ 0, then the corresponding 
substitution is called the effective substitution. 

In order to define the explicit expression for the constrained edit distance, 
an edit transformation can be represented sequentially. Namely, we define a 
2-dimensional edit sequence 8 — ([q;],[/3]) over the alphabet {0,1,^} by the 
following encoding scheme. 

First, let for an arbitrary vector G over ^, 7 denote any vector over {0, 1, (j)} 
such that by removing all the 'empty' symbols from 7 one obtains G. Then, an 
edit-sequence ([a], [(3]) is defined applying the following rules: 

1. The lengths of [a] and [0\ are equal to N. 

2. If a{i) and l3{i) are non-empty symbols, then the substitution of the symbol 
a{i) by the symbol takes place, for any 1 < z < A^. 

3. If a{i) is not the 'empty' symbol, and is the 'empty' symbol, then the 
deletion of a{i) takes place, for any 1 < i < N . 

4. For any 1 <i < N no other cases apart from 2. and 3. are allowed. 

There is an one-to-one correspondence between the set of all the permitted 
edit sequences ([a], defined as above, denoted by /^(X, Y), and the set of all 
the permitted edit transformations of X into Y. 

The edit distance can be expressed in terms of edit sequences by: 



Example: Let X = (1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1) and Y = (1, 0, 1, 1, 1, 1, 0). An 
edit sequence that corresponds to a permitted edit transformation is given by 



Assuming that the elementary distances associated with deletions and effec- 
tive substitutions are all equal to one, by using (4), one can determine that the 
edit distance corresponding to this edit sequence is 6. 




8 = 



1 1 1 1 1 1 
10001110100 



The unconstrained edit distance can be calculated recursively, by filling the 
matrix of partial edit distances [9]. Let X and Y be vectors of lengths N and M, 
respectively, over the finite alphabet A. Let e be the number of deletions and let 
s be the number of substitutions in an edit transformation of the prefix Xg+s of 
X to the prefix Yg of Y. Let d{x, y) be the elementary edit distance associated 
with the substitution of the symbol x by the symbol y and let d{x, (j)) be the 
elementary edit distance associated with the deletion of the symbol x from X. 
Then the partial edit distance VF[e, s] between Xe+s and Yg can be computed 
following the same lines as in [9]: 



In the sequel, it will be assumed that: 

1. d{x^ (j)) = de, Vx G A. 

2. d{x,x) = 0, Vx G A. 

The following algorithm implements the relation (5) in order to determine 
the edit distance between the given vectors of different lengths. 
Algorithm 1 
INPUT: 

The vectors X and Y of lengths N and M, respectively. 
Elementary edit distance de associated with the deletion of a symbol from X. 
Elementary edit distance d{x, y) associated with the substitution of the sym- 
bol X by the symbol y, yx,y. 
OUTPUT: 

The matrix W of partial edit distances associated with the transformations 
of the prefixes of the vector X to the corresponding prefixes of the vector Y. 
// Initialization: 
W[0] [0] = ; 

// The column of the matrix W: 
for e = 1 to N 

W[e] [0] = W[e-1] [0] + de ; 
// The row of the matrix W: 
for s = 1 to M 

W[0][s] = W[0][s-1] + d(X[s],Y[s]) ; 
// Main loop: 
for e = 1 to N 

for s = 1 to min(N-e,M) 
W[e][s] = min(W[e-l] [s] + de , W[e] [s-1] + d(X [e+s] , Y [s] ) ) ; 
// Calculate the edit-distance : 
d(X,Y) = W[N-M] [M] . 




e=l,... 



N s = 



l,...,min{AA-e,M}. 



(5) 



It is possible to reconstruct one of the possible optimal edit sequences, either 
by backtracking through the matrix of partial edit distances W or by maintain- 
ing pointers to the cells of W during the execution of the Algorithm 1. The 
reconstructed sequence is not unique in general. The following algorithm recon- 
structs one of the possible optimal edit sequences that transforms the vector 
X of length N into the vector Y of length M, by backtracking through the 
matrix W. 

Algorithm 2 

INPUT: 

The vectors X and Y of lengths N and M, respectively. 
Elementary edit distance de associated with the deletion of a symbol from X. 
The matrix W of partial edit distances between the prefixes of the vector X 
and the corresponding prefixes of the vector Y. 
OUTPUT: 

One of the optimal edit sequences {[a], [/?]) that transforms X into Y. 
// Initialization: 
e = N-M ; s = M ; 

L = ; // The length of the edit sequence 
// Backtracking through the matrix W: 
while ((e>0) or (s>0)){ 

if (W[e] [s] == W[e-1] [s] + de){ 
L++ ; 

alpha [L] = X[e+s] ; 
beta[L] = 0; 
e~ ; 

} 

else{ 
L++ ; 

alpha [L] = X[e+s] ; 
beta[L] = Y[s] ; 
s~ ; 

} 

} 

3 The new heuristic 

In this section we describe the new heuristic for clustering unequal length vectors, 
analogue to the fc-means algorithm. 

Let P be a set of vectors, whose cardinality is m, and whose elements are 
pi, . . . , pm, of dimensions ni, . . . , Um, respectively. Let the coordinates of the 
vectors take values from the discrete finite alphabet A. Let k be the number of 
clusters given in advance. The heuristic starts from an arbitrary initial partition 
of V into k clusters, Pi, . . . , Pfc. Since the vectors in the clusters are of different 
lengths, the heuristic expands the shorter vectors in every cluster to the length 
of the longest one in the same cluster. This is carried out by means of the 



optimal edit sequences that transform the longest vector in the cluster to each 
of the remaining vectors in it. Then the coordinates of the cluster's centroid 
are calculated counting the symbols at the corresponding coordinates of the 
expanded vectors and selecting the most frequent symbol at each coordinate to 
be the symbol at the corresponding coordinate of the new centroid. Note that 
the obtained centroid is expanded in general, since it can contain empty symbols. 
The final new centroid of the cluster is obtained by removing the empty symbols 
from the expanded centroid. 

The new clusters are created by finding the minimum value of the uncon- 
strained edit distance between each member of V and the new centroids. The 
process continues until the new clusters are equal to the previous ones. 

The following is the formal description of the heuristic. 

Algorithm 3: 

INPUT: 

The set V of vectors pi, . . . , of lengths ni, . . . , n^, respectively. 

The number of clusters k. 

OUTPUT: 

A partition of the set V into k clusters. 
// Initialization: 
terminate = false ; 

Select the initial centroids Ci, . . . , Cfe. These could be, for example, any k 
vectors from the input set chosen at random. 

Calculate the unconstrained edit distance between every vector from V and 
every centroid from the set C — {Ci, . . . , Cfc}, using the Algorithm 1. Assign 
every vector from V to the nearest centroid. In such a way, the initial clustering 
Pi , . . . , Pfc is obtained. 

Main loop: 

while (not terminate)! 

// Calculate the new centroids: 

Let pf be the longest vector in the cluster P^, i — 1, ... A;. Let pj', r G 

Pi |}\{s} be other elements of the cluster P^. 
For each r G {1, . . . , | P^ |}\{s}, find an optimal edit sequence {[a\], [/?[]) 
that transforms pf to p[, using the Algorithm 1 and the Algorithm 2. 
The [j3l] is the expanded vector p[. 

Find the symbol that prevails at every coordinate of the expanded vectors 
r G {!,..., I Pi |} \ {s}. Make this symbol the new value of the 
corresponding coordinate of the new expanded centroid C'^ of the cluster 
Pi. 

Remove the empty symbols from the expanded centroid Cf . In such a 
way, the new centroids C^', i — 1, . . . /c are obtained. 

// Reassign the input vectors to the new centroids: 

Assign every input vector from V to the nearest centroid from the set 
C = {C(, . . . , C^}, by calculating the edit distance between the vectors 
and the new centroids, using the Algorithm 1. Thus the new clustering 
P{ , . . . , P^ is obtained. 



// Check if the new clustering is equal to the previous one: 
if ((Pi,...Pk) == (Pi',...,Pk')) 

terminate = true ; 
else{ 

(Pl,...,Pk) = (Pi',...,Pk') ; 

C = C' ; 

} 

} 

In the Algorithm 3, the new centroids of the clusters are obtained by count- 
ing the symbols at the coordinates of the expanded vectors and selecting the 
symbol that prevails. There might, however, be the cases in which all the pos- 
sible symbols occur equal number of times at one of the coordinates. Possible 
solutions of the problem in such cases are: 

1. Choose the symbol at random, among those present at the coordinate; 

2. Choose the symbol whose position in the alphabet is the closest to the first 
symbol; 

3. Choose the symbol whose position in the alphabet is the closest to the last 
symbol; 

4. Choose the empty symbol, if present at the coordinate. 

Obviously, the most objective results are obtained by selecting the symbol at 
those coordinates at random. But by using other variants, one can fine-tune the 
heuristic favorizing various types of new centroids. For example, by choosing the 
empty symbol in such cases, the shorter centroids are favorized, since the empty 
symbols are dropped from the expanded centroids. 

It is easy to see that the coordinates of the input vectors can take values from 
different alphabets. This can be of particular importance in the applications. 
In that case, the elementary edit distances can be defined (although it is not 
obligatory) in such a way that the substitutions of the symbols from the same 
alphabet are favorized. 

4 The complexity of the heuristic 

It can be shown (see for example [2]) that both time and space complexities of 
the fc-means algorithm are 0{m), where m is the number of input vectors. 

The new heuristic described in this paper basically consists of the same 
steps as the /c-means algorithm. Let m be the number of vectors of dimensions 
ni, . . . , rim in the input data set of the heuristic. Let k be the number of clusters 
and let R be the number of iterations (i.e. the number of times new centroids 
are calculated) of the heuristic. In the z-th iteration, let m\, . . . , m], be the car- 
dinalities of the clusters, i = 1, . . . ,R. In every iteration, prior to determining 
of the symbols that prevail at the coordinates of the expanded vectors in the 
cluster, these expanded vectors must be obtained by means of the Algorithms 
1 and 2. The complexity of these algorithms is quadratic in the length of the 
input sequences [9]. Let nmax be the length of the longest vector in the input 



data set. Then the number of operations needed to transform all the vectors in 
all the clusters to the longest vector of their corresponding cluster is ~ rnn^g^^, 
since the elementary edit operations are only deletions and substitutions, thus 
making the lengths of all the edit secuences < nmax- 

To determine the new centroids starting from the expanded vectors, ~ {m\ + 
• • • + m|.)nniax operations are needed in each iteration, i = 1, . . .R. Since by 
definition the clusters are mutually exclusive, the number of operations needed 
to determine the new centroids of all the clusters is ~ mnmax- 

The number of operations needed to assign all the input vectors to the nearest 
centroid is ~ ^?^?^max- Thus the total number of operations of the heuristic is 
~ -Rm(n^g^(l+A;)+ninax), which means that the time complexity of the heuristic 
is 0{m). 

In [2] the convergence properties of the fc-means algorithm have been stud- 
ied experimentally and the conclusion was given that the expected number of 
iterations was very small. Having in mind the essential similarity between this 
heuristic and the /c-means algorithm, similar results to those from [2] concerning 
the convergence can be expected. 

The storage needed by the heuristic is ~ nmax'm memory cells to store the 
input vectors, and ~ n^^^ memory cells needed to store the matrix of partial 
edit distances. Thus the space complexity of the heuristic is 0{m). 



5 Experimental results 

The heuristic described in this paper was tested in two ways: 

1. A set of artificially generated discrete vectors was clustered by means of the 
heuristic. 1000 random samples were generated, consisting of 2000 vectors 
of length at most 20, naturally grouped into 2 clusters. In 40% of these 
examples, there was no overlapping between clusters at all. In 30% of the 
examples there was 10% of overlapping between clusters. Finally, in 30% 
of the examples there was 20% of overlapping between clusters. The num- 
ber of incorrectly clustered vectors was assigned to categories. The results 
obtained with the heuristic are presented in the Fig. 1. As it can be seen, 
the cases without overlapping were resolved correctly, whereas in the cases 
with overlapping the results obtained with the heuristic depended directly 
on the overlapping degree. This behaviour of the heuristic was similar to 
the behaviour of the A;-means heuristic which is known to be sensitive to 
overlapping [3]. 

2. A data set originating from the system whose intention was to classify the at- 
tacks on a Web server into a number of categories was clustered by means of 
the heuristic. In this system, there is a need to cluster a substantial amount 
of vectors of different lengths that describe the security alerts. The encoding 
scheme of these alerts is given in [1]. The input vectors have discrete co- 
ordinates that generally take values from different alphabets. The encoding 
given in [1] recognizes 9 properties, which means that the maximal length of 



a vector from the input data set can be 9. But the number of input vectors 
can be very large. The correctness of the clustering was checked against the 
values of the "severity" indicator of the "SNORT" intrusion detection sys- 
tem [12]. 1000 samples originating from the system were tested, consisting of 
5000 vectors of length at most 9, where the number of clusters (according to 
the " severity" value defined in the " SNORT" system) varied between 2 and 
4. The number of incorrectly clustered vectors was assigned to categories. 
The results obtained with the heuristic are presented in the Fig. 2. As it can 
be seen, the input vectors were clustered correctly in ^ 75% of the given 
cases. 
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Fig. 2 - Results of clustering vectors from a real system 



6 Conclusion 



In this paper, a new heuristic for clustering unequal length vectors with discrete 
coordinates is described and analyzed. The heuristic is analogue to the A;-means 
algorithm for clustering equal length real vectors. It expands the member vectors 
of the clusters to the lengths of the longest vectors in those clusters, in order 
to determine the new centroids in each iteration. It was shown that both time 
and space complexities of the heuristic are linear in the number of input vectors. 
The experimental results show that the behaviour of the heuristic is similar to 
that of the A;-means algorithm considering overlapping of the clusters, and that 
the level of correctness of clustering with the heuristic is promissing. 
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